ITEA  Journal  29:  254-262 

Copyright  ©  2008  by  the  International  Test  and  Evaluation  Association 


Best  Practices  for  Reliability  Assessment  and  Verification 

Michael  J.  Cushing,  Ph.D. 

U.S.  Army  Evaluation  Center, 

U.S.  Army  Test  &  Evaluation  Command,  Aberdeen  Proving  Ground,  Maryland 

Margaret  Hockenberiy 
U.S.  Army  Materiel  Systems  Analysis  Activity, 

U.S.  Research  Development  and  Engineering  Command,  Aberdeen  Proving  Ground,  Maryland 

E.  Andrew  Long 

Logistics  Management  Institute,  McLean,  Virginia 

The  Department  of  Defense  (DoD)  is  working  closely  with  the  Government  Electronics  and 
Information  Technology  Association  on  the  development  of  a  GEIA-STD-0009,  Reliability 
Program  Standard  for  Systems  Design,  Development,  and  Manufacturing,  at  the  behest  of  the 
Defense  Science  Board  Developmental  Test  Task  Force.  It  is  hoped  that  GEIA-STD-0009  will 
improve  the  odds  that  military  systems  will  successfully  demonstrate  reliability  requirements  in 
both  developmental  and  operational  testing.  This  article  provides  an  overview  of  GEIA-STD- 
0009  along  with  initial  guidance  regarding  its  application,  with  an  emphasis  on  the  assessment 
and  verification  of  system  reliability. 

Key  words:  Developmental  testing;  failure  mode;  GEIA-STD-0009;  reliability; 
operational  load;  system  design;  system  reliability  modeling. 


During  the  past  year,  the  U.S.  Depart¬ 
ment  of  Defense  (DoD)  has  been 
working  closely  with  hoth  industry 
and  the  Government  Electronics  and 
Information  Technology  Association 
(GEIA)  on  the  development  of  a  new  standard, 
GEIA-STD-0009,  Reliability  Program  Standard  for 
Systems  Design,  Development,  and  Manufacturing.  The 
DoD’s  motivation  for  this  undertaking  is  that  many 
systems  are  not  achieving  the  required  reliability  during 
developmental  testing  and  are  subsequently  found 
unsuitable  during  Initial  operational  test  and  evalua¬ 
tion.  The  Defense  Science  Board  Developmental  Test 
(DT)  and  Evaluation  Task  Force  examined  this  issue 
and  concluded  that  a  new  reliahility  program  standard 
is  urgently  needed.  The  purpose  of  this  article  is  to 
provide  a  brief  overview  of  the  new  standard  as  well  as 
guidance  regarding  how  to  best  assess  and  verify  system 
reliability  with  it. 

GEIA-STD-0009  overview 

Embodied  in  GEIA-STD-0009  is  a  new  approach 
to  the  development,  production,  and  fielding  of  reliable 
systems.  As  depicted  in  Figure  1,  the  standard  is 
primarily  comprised  of  four  objectives: 


1.  understand  customer/user  requirements  and  con¬ 
straints 

2.  design  and  redesign  for  reliability 

3.  produce  reliable  systems/products 

4.  monitor  and  assess  user  reliability 

During  the  development  of  GEIA-STD-0009,  the 
Working  Group  identified  the  essential  reliability 
processes  (termed  '''' reliability  activities  both  in  the 
Standard  and  herein)  that  simply  must  be  performed  in 
order  to  design,  grow,  build,  and  field  reliable  systems. 
The  reliability  activities  are  mandatory  in  nature  but 
merely  specify  “what  to  do.” 

GEIA-STD-0009,  at  its  core,  is  a  reliability 
engineering  and  growth  process  that  is  fully  integrated 
with  systems  engineering  as  depicted  in  Figure  2.  The 
new  standard  is  not  a  menu  of  reliability  tasks  that  one 
may  select  from  as  with  many  previous  reliability 
program  standards.  Readers  who  are  concerned  that 
their  favorite  reliability  methods  or  tools  do  not  figure 
prominently  in  the  Standard  should  not  fear.  The 
primary  mechanism  for  tailoring  GEIA-STD-0009  is 
by  selecting  “how  to”  and  best  practices  in  order  to 
implement  each  of  the  activities.  Many  of  these 
methods  and  tools  are  listed  in  Annex  A  of  the 
Standard  and  are  essential  to  the  implementation  of 


254  \JEA  Journal 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:54:31  254 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

2QQg  2.  REPORT  TYPE 

3.  DATES  COVERED 

00-00-2008  to  00-00-2008 

4.  TITLE  AND  SUBTITLE 

Best  Practices  for  Reliability  Assessment  and  Verification 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

U.S.  army  Test  &  Evaluation  Command, Aberdeen  Proving 

Ground, MD, 21005 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF:  17.  LIMITATION  OF 

_ _ _  ABSTRACT 

18.  NUMBER  19a.  NAME  OF 

OF  PAGES  RESPONSIBLE  PERSON 

a.  REPORT  b.  ABSTRACT  c.  THIS  PAGE  Same  aS 

unclassified  unclassified  unclassified  Report  (SAR) 

9 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


Best  Practices  for  Reliability  Assessment  and  Verification 


-QuantKatiue  Reliabilfty  Requi^ments— ^ 

-  Usage  Profile  - 


Interpret  requirements 

_  /SOW,RFPfl^PI,  _ 

Customer  Specifications, 
User  Requirements 


-  Environmental  Profile  ■ 


-Cost  and  Schedule  Requirements  - 
-  Funding  Profile - 


—  Known  Failure  Modes  and  Mechanismsr-^ 


Failure  Definition  and_ 
Scoring  Criteria 


Objective  1 : 

Understand 
Customer/User 
Requirements  and 
Constraints 


-  Initial  Reliability  Program  Plan 


Systemfroduct-Level  User  and 
—  Environmental  Profiles 


—Initial  Reliability  flow-down  Requirements 


_FDSC  that  are  integrated  with  Closei^ 
Loop  Failure  Mitigation  Process 


Process,  Tools,  Reliability  Testing, 
“Reliability  Growth  Factcrs/Criteria 


—  Candidate  Reliability  Trade  Studies. 
- NDI,  COTS,  and  CFI 


_  Initial  Reliability  Requirements 
Verification  Strategy ^Plan 


.  Refined  System^Product-Level  — 
User  and  Environmental  Profiles 


Jnitial  estimates  of  loads  that  assemblies 
will  experience  during  the  life  cycle 

Updated  reliability  assessment,  including 

_ results  of  reliability  growth  activities  _ 

(analyses  and/cr  testing) 

Engineering  analysis  and  test  data 
—  identifying  the  system/product  failure 
modes  and  distributions  that  will  result 
from  the  life-cycle  loads 


_  Updated,  integrated  Reliability  _ 

Requirements  Verification  StrategyjPlan 


CO 

m 


O 


Figure  1.  GEIA-STD-0009  objectives 


the  reliability  activities.  A  reliability  scorecard  devel¬ 
oped  by  AMSAA  and  the  DoD  Reliability  Improve¬ 
ment  Working  Group  can  be  used  to  guide  the 
selection  of  reliability  methods,  tools,  and  best 
practices.  The  scorecard  can  be  found  at  the  Defense 
Acquisition  University’s  website  by  selecting  the 
Acquisition  Community  Connection,  then  the  Reli¬ 
ability  &  Maintainability  Special  Interest  Area, 


Identify  failure  modes  &  mechanisms 
Characterize  Operational  &  Environmental  Loads 


Figure  2.  GEIA-STD-0009  integrated  growth 


then  tools.  The  current  link  is:  https://acc.dau.mil/ 
CommunityBrowser.aspxPid = 210483Sdang= en-U  S . 

It  is  envisioned  that  the  developer  will  be  tasked  up¬ 
front  to  draft  a  Reliability  Program  Plan,  perhaps  as 
part  of  the  System  Engineering  Plan,  so  that  the 
staffing  and  scheduling  of  the  reliability  program  will 
be  understood  and  budgeted  from  the  beginning. 
Experience  teaches  that  if  the  developer  does  not 
properly  budget  and  plan  for  the  reliability  program 
before  contract  award,  it  is  very  difficult  to  fold  it  in 
afterwards. 

Reliability  engineering  and  growth 
integrated  with  systems  engineering 

As  depicted  in  Figure  2,  GEIA-STD-0009  embod¬ 
ies  a  systematic  design-reliability-in  process,  not  a 
process  that  focuses  on  identifying  and  improving  a  few 
reliability-critical  components.  There  are  three  critical 
elements: 

1.  progressive  understanding  of  the  system-level 
operational  and  environmental  loads  and  the  resulting 
loads  and  stresses  that  occur  throughout  the  structure 
of  the  system; 

2.  progressive  identification  of  the  resulting  failure 
modes  and  mechanisms; 

3.  aggressive  mitigation  of  surfaced  failure  modes. 


29(3)  •  September  2008  255 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:54:32  255 


Cushing,  Hockenberry,  6c  Long 


Belgian  Block 


6  -  Inch  Washboard 


Figure  3.  Dynamic  simulation  of  tactical  wheeled  vehicle 


10  -  Inch  Half  Rounds 


Operational  and  environmental  loads 
and  stresses 

The  reliability  that  a  system  will  demonstrate  is,  in 
part,  a  function  of  the  life-cycle  operational  and 
environmental  stresses  that  occur  throughout  the 
structure  of  the  system.  Operational  loads  result  from 
user  or  maintainer  actions  as  well  as  from  external 
systems  the  system  under  development  will  interface 
with. 

In  GEIA- STD-0009,  the  operational  and  environ¬ 
mental  loads  to  be  imposed  on  the  system  are 
progressively  characterized  and  designed  for  through¬ 
out  development.  This  effort  starts  with  information 
from  the  customer.  For  DoD  customers,  the  system- 
level  operational  and  environmental  loads  are  typically 
defined  by  an  Operational  Mode  Summary/Mission 
Profile  (OMS/MP).  GEIA- STD-0009  explicitly  tasks 
the  developer  to  study  the  OMS/MP  and  work  with 
the  customer  in  order  to  obtain  added  details  if  the 
OMS/MP  is  not  specific  enough  for  engineers  to 
design  to.  If  need  be,  the  developer  will  seek  access  to 
customer  assets  (e.g.,  test  courses  or  vehicles  that  the 


system  will  be  integrated  with)  in  order  to  obtain  the 
needed  specifics. 

The  developer  progressively  characterizes  the  result¬ 
ing  loads  and  stresses  throughout  the  structure,  down 
to  components  or  assemblies  being  selected  and 
integrated  into  the  design,  to  include  commercial  off- 
the-shelf  (COTS),  nondevelopmental  items  (NDI), 
and  government-furnished  equipment  (GEE).  It  is  not 
possible  to  design  reliable  components,  nor  select  and 
reliably  integrate  COTS,  NDI,  and  GEE,  without 
accurate  estimates  of  the  loads  to  be  imposed  on  them. 
The  operational  and  environmental  load  estimates 
must  be  verified  to  be  operationally  realistic  with 
measurements  using  the  production-representative 
system  in  time  to  be  used  for  reliability  verification. 

The  progressive  characterization  of  loads  and 
stresses  is  routinely  done  by  the  U.S.  Army.  Figure  3 
depicts  the  dynamic  simulation  of  a  tactical  wheeled 
vehicle  traversing  some  of  the  challenging  road  surfaces 
found  at  the  Aberdeen  Test  Center.  The  simulation 
provided  loading  information  on  various  suspension 
components  including  the  A-arm.  Figure  4  depicts  the 


256  iJFk  Journal 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:54:33  256 


Best  Practices  for  Reliability  Assessment  and  Verification 


Figure  4.  Instrumenting  of  A-arm 


instrumenting  of  an  A-arm  on  an  actual  vehicle.  The 
simulation  and  test  data  "were  compared  in  order  to 
confirm  the  accuracy  of  the  simulation  model. 

Identify  and  characterize  failure  modes 
and  mechanisms 

As  depicted  in  Figure  2,  GEIA- STD-0009  includes 
a  robust  effort  to  identify  and  characterize  failure 
modes  and  mechanisms  as  soon  as  development  begins. 
This  is  essential  if  the  system  is  to  enter  subsystem  test 
with  a  level  of  reliability  that  will  lead  to  the  successful 
achievement  of  reliability  requirements. 

Teams  developing  assemblies,  subassemblies,  and 
components  for  a  system  identify  and  confirm  through 
analysis,  test,  or  accelerated  test  the  failure  modes  and 
distributions  that  will  result  when  life-cycle  operational 
and  environmental  loads  are  imposed  on  these 
assemblies,  subassemblies,  and  components.  Teams 
selecting  and  integrating  items  not  specifically  devel¬ 
oped  for  this  system  (which  may  include  COTS,  NDI, 
and  GFE,  as  well  as  other  assemblies,  subassemblies, 
and  components)  identify  and  confirm  the  failure 
modes  and  distributions  that  will  result  when  these 
life-cycle  loads  are  imposed  on  these  items.  Estimates 
of  life-cycle  operational  and  environmental  loads  on 
assemblies,  subassemblies,  and  components  are  used  as 
inputs  to  engineering-based  and  physics-based  models 
in  order  to  identify  failure  mechanisms  and  the 
resulting  failure  modes. 

Figure  5  illustrates  the  A-arm  from  the  tactical 
wheeled  vehicle  depicted  in  Figures  3  and  4.  A  likely 
failure  mode  for  the  A-arm  is  that  a  crack  will  develop 
and  grow  as  a  result  of  fatigue.  Finite  element  analysis 
was  used  in  order  to  estimate  the  stresses  throughout 
the  component  that  would  result  from  the  cyclic  loads 
placed  on  it.  Figure  5  depicts  the  results,  including 
when  and  where  fatigue  failure  should  first  occur. 


Figure  5.  A-arm  fatigue  life  calculations 


It  is  often  the  case  that  the  discovery  of  failure  modes 
that  are  typically  charged  to  operators  or  maintainers 
does  not  occur  until  testing  with  actual  operators  and 
maintainers  begins.  GEIA- STD-0009  includes  a  pro¬ 
active  requirement  that  these  failure  modes  are  to  be 
identified  through  analysis  during  system  design. 
Failure  modes  and  distributions  that  may  be  induced 
by  manufacturing  variation  or  errors  are  also  to  be 
identified  during  design  rather  than  waiting  until 
production.  It  is  generally  simpler  and  less  expensive 
to  mitigate  failure  modes  the  earlier  they  are  discovered. 

GEIA- STD-0009  requires  that  all  failures  that 
occur  during  accelerated,  subsystem,  or  system  testing 
are  analyzed  until  the  root-cause  failure  mechanism  has 
been  identified.  Identification  of  the  failure  mechanism 
provides  the  insight  essential  to  the  identification  and 
formulation  of  reliability  improvements.  The  process  of 
identifying  and  understanding  failure  modes  and 
mechanisms  continues  as  the  design  and  manufactur- 
ing  processes  evolve. 

Failure-mode  mitigation 

The  developer  aggressively  mitigates  failure  modes 
to  ensure  the  reliability  requirements  are  successfully 
verified  and  do  not  degrade  during  production  or  in  the 
field.  Failure  modes  must  be  aggressively  mitigated 
before  subsystem  testing  begins  in  order  to  obtain  a 
reliability  level  that  will  enable  reliability  growth  to  the 
requirement  through  subsystem  and  system  testing. 
Failure  modes  are  mitigated  by  one  or  more  of  the 
following  approaches: 

•  eliminating  the  failure  mode; 

•  reducing  its  occurrence  probability  or  frequency; 

•  incorporation  of  redundancy;  and/or 

•  mitigation  of  failure  effects  (e.g.,  fault  recovery, 
degraded  modes  of  operation,  providing  advance 
warning  of  failure). 


29(3)  •  September  2008  257 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:56:17  257 


Cushing,  Hockenberry,  &  Long 


PM2  -  Reliability  Growth  Planning  Curve 


Figure  6.  Reliability  growth  planning  curve 


The  developer  submits  the  potential  reliability 
improvements  identified  during  the  execution  of  the 
Reliability  Activities  to  the  appropriate  engineering 
organizations  (e.g.,  Systems  Engineering).  The  devel¬ 
oper  employs  a  mechanism  that  is  accessible  by  the 
customer  (e.g.,  a  failure  reporting,  analysis,  and 
corrective  action  system  or  a  data  collection,  analysis, 
and  corrective  action  system)  for  monitoring  and 
communicating  throughout  the  organization  data 
regarding  the  identification  and  mitigation  of  failure 
modes.  Failure  modes  that  are  expected  to  occur  during 
the  system  life  cycle  are  included  in  the  system 
reliability  model. 

Reliability  assessment 

In  GEIA-STD-0009  the  term  ‘'^reliability  assessment” 
denotes  the  periodic  assessment  of  reliability  progress 
towards  requirements  and  it  is  followed  by  “reliability 
verification”  which  denotes  the  formal  verification  that 
requirements  have  been  met.  The  standard  establishes 
seven  general  reliability  assessment  requirements: 

1.  The  developer  assesses  the  reliability  of  the 
system  periodically  throughout  the  life  cycle  using 
the  system  reliability  model,  the  life-cycle  operational 
and  environmental  load  estimates  generated  from  the 
OMS/MP,  and  the  customer-supplied  Failure  Defini¬ 
tions  and  Scoring  Criteria. 

2.  Reliability  assessments  are  made  based  on  data 
from  analysis,  modeling  and  simulation,  test,  and  the 
field,  and  are  tracked  as  a  function  of  time  and 
compared  against  reliability  allocations  and  customer 
reliability  requirements. 


3.  For  complex  systems,  or  when  the  customer 
requires  this,  the  assessment  strategy  includes  reliabil¬ 
ity  values  to  be  achieved  at  various  points  during 
development. 

4.  The  developer  monitors  and  evaluates  the 
reliability  impact  of  changes  to  the  design  or 
manufacture  of  the  system. 

5.  The  implementation  of  corrective  actions  is 
verified  and  effectiveness  is  tracked. 

6.  Formal  reliability  growth  methodology  is  used 
where  applicable  (e.g.,  when  failure  modes  are 
discovered  and  addressed  with  a  test-analyze-and-fix 
process  that  is  applied  to  complex  assemblies)  in  order 
to  plan,  track,  and  project  reliability  improvement. 

7.  Predicted  failure  modes  and  mechanisms  are 
compared  with  those  from  test  and  the  field. 

The  third  requirement  in  the  list  above  is  of 
particular  interest  for  Army  programs  because  new 
policy  requires  that  at  least  one  intermediate  reliability- 
growth  value  be  included  in  the  request  for  proposals. 
Such  an  intermediate  reliability-growth  value  wiU 
permit  the  early  identification  of  a  system  that  is  not 
on-track  towards  meeting  its  reliability  requirement, 
which  will  allow  time  to  make  program  adjustments 
and  intensify  the  reliability  engineering  and  growth 
process.  One  approach  to  obtaining  such  an  interme¬ 
diate  value  is  through  the  use  of  a  reliability  growth 
planning  model.  The  customer  can  develop  a  reliability 
growth  plan  up-front  based  on  the  program  schedule, 
test  assets,  and  some  assumptions  concerning  the 
intensity  of  the  reliability  growth  effort. 

Figure  6  depicts  a  notional  Army  reliability  growth 
planning  curve  based  on  the  PM2  model  (Ellner  and 


258  UFA  Journal 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:56:22  258 


Best  Practices  for  Reliability  Assessment  and  Verification 


Hall  2006).  Using  the  PM2  model,  the  Army 
determines  that  the  system  mean  time  between  failures 
(MTBF)  must  grow  to  1,227  as  a  point  estimate  in 
order  to  have  a  reasonable  chance  (50  percent  in  this 
case)  of  demonstrating  690  hours  with  80  percent 
statistical  confidence  (assuming  a  10  percent  drop  from 
DT  to  operational  test  (OT)).  This  plan  consists  of 
four  corrective  action  periods  between  five  test  events: 
a  Customer  Test,  an  Initial  DT,  a  Limited  User  Test,  a 
Low-Rate  Initial  Production  DT,  and  an  Initial  OT. 
Since  this  is  an  Army  program,  the  Initial  DT  is  where 
the  system  MTBF  must  be  demonstrated  to  be  at  least 
70  percent  of  the  requirement  with  50  percent 
statistical  confidence.  The  Army  can  incorporate  this 
plan  in  the  Request  for  Proposals  so  that  the  developer 
can  design  the  reliability  program  accordingly. 

The  fourth  reliability  assessment  requirement  in  the 
list  above  (i.e.,  monitoring  and  evaluating  the  reliability 
impact  of  changes  to  the  design  or  manufacture  of  the 
system)  is  critically  important  to  maintaining  reliability 
during  production  and  in  the  field,  and  it  may  require 
an  intensive  effort  given  the  widespread  use  of  complex 
global  approaches  to  manufacturing.  Several  methods 
for  implementing  this  requirement,  such  as  parts 
control  and  supply  chain  management,  are  identified 
in  Annex  A  of  the  standard. 

The  failure  mode  identification  and  mitigation 
activities  discussed  earlier  lead  to  a  two-part  reliability 
growth  program: 

•  Reliability  growth  driven  by  (a)  engineering- 
based  and  physics-based  models,  (b)  accelerated 
testing  of  low  indenture-level  items,  (c)  analyses 
that  identify  failure  modes  related  to  manufac¬ 
turing  variation  and  workmanship  errors,  and  (d) 
analyses  that  identify  failure  modes  that  are 
typically  charged  to  operators  or  maintainers. 

•  Reliability  growth  driven  by  a  test-analyze-and- 
fix  process  applied  under  operationally-realistic 
conditions  to  complex  assemblies  such  as  subsys¬ 
tems  and  systems. 

The  first  part  of  the  growth  program  provides  the 
high  starting  point  for  the  traditional  reliability  growth 
program  that  is  pivotal  to  success. 

The  Reliability  Assessment  process  consists  of  two 
types  of  DT : 

•  testing,  primarily  accelerated  testing,  of  low 
indenture-level  items  such  as  components  and 
noncomplex  assemblies,  in  order  to  surface  and 
mitigate  failure  modes  not  readily  identified  with 
engineering-based  and  physics-based  reliability 
modeling;  and 

•  testing  of  complex  assemblies  such  as  subsystems 
and  systems  in  order  to  surface  and  mitigate 
failure  modes  not  readily  identified  otherwise. 


Reliability  assessment  can  be  divided  into  three 
phases: 

•  assessment  of  requirements  feasibility; 

•  assessment  before  subsystem  testing  begins;  and 

•  assessment  after  subsystem  testing  begins. 

Each  wiU  be  addressed  in  more  detail. 

Requirements  feasibility 

During  the  execution  of  the  first  objective,  the 
developer  must  acquire  an  understanding  of  the 
customer’s  reliability  requirements.  It  is  at  this  point 
that  an  assessment  of  the  feasibility  of  the  requirements 
is  made.  The  system  reliability  model  is  used,  in 
conjunction  with  expert  judgment,  to  assess  if  the 
design  (including  COTS,  NDI,  and  GFE)  is  capable 
of  meeting  reliability  requirements  in  the  user 
environment.  If  the  assessment  is  that  the  customer’s 
requirements  are  infeasible,  the  developer  communi¬ 
cates  this  to  the  customer.  Clearly  this  is  not  an 
analysis  of  a  design  but  is  rather  an  assessment  of 
whether  it  is  possible  for  a  new  design  to  meet 
reliability  requirements  given  previous  designs  and 
projections  of  potential  improvements. 

Assessment  before  subsystem  testing  begins 

In  general,  it  is  not  possible  to  estimate  the 
reliability  that  a  system  will  demonstrate  under 
operationally-realistic  conditions  until  subsystem  and 
system  testing  under  these  conditions  begins.  This  is 
why  this  portion  of  the  reliability  growth  curve  is 
dotted  in  Figure  2.  What  can  be  done  at  this  stage  is  an 
expert  assessment  of  the  quantity  and  quality  of  the 
failure  modes  identified,  and  the  effectiveness  of  the 
associated  mitigation.  A  key  rule  of  thumb  is  that  a 
high  percentage  of  failure  modes  surfaced  must  be 
effectively  mitigated  in  order  to  put  the  system  on  a 
successful  reliability-growth  path.  Mitigation  effec¬ 
tiveness  can  be  evaluated  in  a  variety  of  ways.  If  the 
failure  mode  was  identified  through  the  use  of 
engineering-based  or  physics-based  models,  accelerat¬ 
ed  testing  can  be  used  to  confirm  that  it  occurs  as 
expected  and  is  well  understood.  It  is  also  beneficial  to 
compare  the  predicted  to  measured  operational  and 
environmental  loads  and  stresses.  This  is  beneficial 
because  these  loads  are  used  to  design  reliability  into 
new  components  as  well  as  select  and  reliably  integrate 
COTS,  NDI,  and  GEE. 

A  major  pitfall  to  be  avoided  concerns  predicting 
system  reliability  under  operationally-realistic  condi¬ 
tions.  This  is  generally  not  possible  before  system 
reliability  testing  begins  which  can  be  quite  frustrating. 
Many  programs  perform  handbook-based  reliability 
predictions  but  such  predictions  are  inaccurate  because 
operational  reliability  is  largely  determined  by  stress 


29(3)  •  September  2008  259 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:56:23  259 


Cushing,  Hockenberry,  6c  Long 


and  design  specifics  that  handbook  prediction  models 
do  not  accept  (Pecht  and  Nash  1994).  Reliance  on 
handbook  predictions  can  lead  a  program  to  believe  the 
system  is  ready  for  Reliability  Verification  when  it  is 
not. 

Assessment  after  subsystem  testing  begins 

Estimation  of  system  reliability  can  begin  once 
testing  of  subsystems  or  systems  under  operationaUy- 
realistic  conditions  begins.  Testing  of  the  first 
configuration  establishes  the  initial  reliability  for 
reliability  growth  tracking.  The  implementation  of 
corrective  actions  is  verified  and  effectiveness  is 
tracked.  Predicted  failure  modes/mechanisms  are 
compared  with  those  from  test  and  the  field.  Reliability 
growth  methodology  is  used  to  plan,  track,  and  project 
reliability  based  on  failure  data  from  complex  assem¬ 
blies  tested  under  operational  and  environmental  loads. 
Military  Handbook  189,  which  is  currently  being 
revised,  may  be  used  as  a  guide.  One  may  also  consult 
the  DoD  Guide  for  Achieving  Reliability,  Availability, 
and  Maintainability. 

Reliability  verification 

As  mentioned  earlier,  in  GEIA- STD-0009  the  term 
‘‘‘'reliability  verification'  denotes  the  formal  verification 
that  requirements  have  been  met.  The  standard 
establishes  six  general  reliability  verification  require¬ 
ments: 

1.  The  developer  plans  and  conducts  activities  to 
ensure  that  the  achievement  of  reliability  requirements 
is  verified  during  design. 

2.  The  developer  develops  and  periodically  refines  a 
Reliability  Requirements  Verification  Strategy/Plan 
that  is  an  integral  part  of  the  systems-engineering 
verification  and  is  coordinated  and  integrated  across  all 
phases. 

3.  The  strategy  must  further  ensure  that  reliability 
does  not  degrade  during  production  or  in  the  field. 

4.  The  verification  is  based  on  analysis,  modeling  6c 
simulation,  testing,  or  a  mixture,  and  must  be 
operationally  realistic. 

5.  The  verified  system-level  operational  6c  environ¬ 
mental  life-cycle  loads,  as  well  as  the  Failure 
Definitions  and  Scoring  Criteria,  must  be  used. 

6.  Additional  customer  requirements,  if  any  (e.g., 
reliability  qualification  testing,  testing  in  customer 
facilities,  customer-controlled,  customer-scored  test¬ 
ing),  must  be  included. 

The  latter  portion  of  reliability  assessment  consists 
of  testing  activities  that  the  DoD  refers  to  as  DT.  DT 
is  often  followed  by  OT  to  assess  how  well  the  system 
will  work  when  actual  operators  and  maintainers  use  it 
under  field  conditions.  The  Standard  facilitates  the 


integration  of  DT  and  OT  because  the  following  are 
required: 

•  Operational  loads  (including  from  systems  that 
interface  with  the  system  under  development) 
and  environmental  loads  are  developed  based  on 
the  OMS/MP,  progressively  refined,  and  even¬ 
tually  verified  to  be  accurate  and  operationally 
realistic. 

•  Failure  modes  that  are  typically  charged  to 
operators  or  maintainers  are  identified  earlier. 
These  failure  modes  generally  arise  for  the  first 
time  during  OT  and  result  in  statistically- 
significant  differences  between  the  DT  and  OT 
reliability  estimates. 

•  System  reliability  modeling  is  developed  and 
refined  as  failure  modes  are  identified,  analyzed, 
mitigated,  and  incorporated  in  the  modeling. 

One  item  that  needs  to  also  be  addressed  to  facilitate 
the  estimation  of  reliability  using  both  DT  and  OT 
data  concerns  balancing  of  the  sample  sizes  so  that  a 
statistical  comparison  of  the  reliability  estimates  is 
credible.  Even  though  the  reliability  estimates  from 
DT  and  OT  may  appear  to  be  quite  different,  it  can  be 
difficult  to  prove  this  statistically  if  either  the  DT  or 
OT  sample  size  is  too  small  relative  to  the  other.  One 
must  design  the  DT  and  OT  sample  sizes  so  they  can 
be  credibly  compared  before  deciding  whether  to 
aggregate  them. 

MTBF-type  reliability  requirements  are  often  ver¬ 
ified  using  a  fixed-configuration,  fixed-length  test  plan 
from  Military  Handbook  781  (MH-781).  One  needs 
the  following  information  in  order  to  select  such  a  test 
plan: 

a)  the  MTBF  to  be  demonstrated  with  statistical 
confidence; 

b)  the  minimum  level  of  statistical  confidence  that 
a)  should  be  demonstrated  with; 

c)  the  best  pretest  estimate  of  the  actual  MTBF; 
and 

d)  the  probability  of  passing  the  test  if  c)  is  accurate. 

In  MH-781,  the  MTBF  to  be  demonstrated  with 

statistical  confidence  is  termed  the  “lower-test”  MTBF 
and  the  minimum  level  of  statistical  confidence  it 
should  be  demonstrated  with  equals  one  minus  the 
“consumer  risk.”  So  in  order  to  demonstrate  an  MTBF 
with  at  least  80  percent  confidence,  one  should  select  a 
plan  with  a  consumer  risk  of  20  percent.  It  is  items  c) 
and  d)  that  are  frequently  misunderstood.  The  best 
pretest  estimate  of  the  actual  MTBF  is  termed  the 
“upper-test”  MTBF  in  MH-781.  This  pretest  MTBF 
estimate  must  be  greater  than  the  MTBF  to  be 
demonstrated  with  confidence.  In  order  to  use  one  of 
the  standard  plans  the  ratio  of  the  pretest  estimate  to 
the  MTBF  to  be  demonstrated  with  confidence  must 


260  \JEk  Journal 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:56:23  260 


Best  Practices  for  Reliability  Assessment  and  Verification 


Figure  7.  Army  bridging  system 


be  either  1.5,  2,  or  3.  It  is  unlikely  that  the  pretest 
estimate  and  the  MTBF  value  to  be  demonstrated  with 
confidence  will  have  this  relationship  so  one  should 
expect  to  design  a  custom  plan.  The  probability  of 
passing  the  test  if  the  pretest  estimate  is  accurate  equals 
one  minus  the  “producer  risk.”  If  one  desires  a  test  plan 
where  the  probability  of  passing  is  80  percent,  provided 
the  pretest  MTBF  estimate  is  accurate,  then  a  plan 
with  a  producer  risk  of  20  percent  should  be  selected. 

Many  practitioners  do  not  understand  how  to  select 
a  MH-781  test  plan  as  just  described  which  can  result 
in  the  selection  of  a  test  plan  that  is  unlikely  to  be 
passed.  It  is  expected  that  MH-781,  which  is  currently 
under  revision,  will  be  edited  so  that  the  logic 
described  above  is  clearer.  Regardless  of  the  test¬ 
planning  resource  used,  fixed-configuration,  fixed- 
length  test  plans  must  be  selected  using  items  a) 
through  d)  so  that  the  system  will  be  highly  likely  to 
pass  if  it  is  as  reliable  as  the  developer  believes  and  will 
be  highly  unlikely  to  pass  if  the  MTBF  is  below  the 
requirement. 

In  some  cases  it  is  impossible  to  rely  exclusively  on  a 
reliability  demonstration  and  a  mix  of  modeling, 
analysis,  and  test  may  be  needed.  One  example  is  a 
mobile  Army  bridging  system,  pictured  in  Figure  7, 
that  can  span  gaps  of  up  to  12  meters.  Historically,  the 
cost  and  time  associated  with  conducting  large  scale 
bridge  crossing  tests  precluded  full  testing  of  the 


requirement  to  levels  of  statistical  confidence.  To  solve 
this  problem,  the  Aberdeen  Test  Center  developed  the 
Bridge  Crossing  Simulator  device,  which  physically 
simulates  the  loads  imposed  by  a  crossing  vehicle  on  a 
bridge  under  test,  allowing  durability  testing  to  be 
conducted  quickly  and  economically.  While  the  Army 
bridging  system  was  under  test  on  the  Bridge  Crossing 
Simulator  a  problem  developed.  The  bridge  center 
coupler  connection  failed  before  the  bridge  had 
reached  its  required  durability  life.  Axmy  Materiel 
Systems  Analysis  Activity  (AMSAA)  engineers  used  a 
physics-based  computer  modeling  analysis  technique — 
PoF — to  identify  the  root  causes  of  the  failure  and  to 
recommend  a  design  improvement. 

The  recommendation  suggested  adding  structural 
angle  sections  to  connect  the  center  couplings  of  the 
bridge  to  the  vertical  webs,  which  would  create  a  much 
stronger  double-shear  connection.  The  new  design 
proposal  eliminated  the  weak  spot  in  the  weld  between 
the  bridge  bottom  flange  and  vertical  web  where  the 
previous  failure  had  originated.  U.S.  Army  Tank 
Automotive  Research,  Development  and  Engineering 
Center  (TARDEC)  engineers,  along  with  the  Product 
Manager  (PM)  Assured  Mobility  Systems,  reviewed 
the  results  of  an  upgrade  feasibility  study  performed  by 
the  bridging  system  prime  contractor  to  address 
increased  requirements.  They  determined  that  the 
suggested  design  improvement  might  not  only  fix  the 


29(3)  •  September  2008  261 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:56:23  261 


Cushing,  Hockenberry,  6c  Long 


immediate  problem,  but  would  also  provide  the 
additional  margin  needed  to  upgrade  the  bridge’s  load 
capacity.  The  PM  Bridging,  located  in  the  Program 
Management  Office  Force  Projection,  capitalized  on 
the  confluence  of  events  and  moved  forward  to  upgrade 
the  bridge.  TARDEC  and  the  bridging  system 
contractor  worked  to  implement  the  AMSAA  recom¬ 
mendation  and  to  add  other  enhancements  to  ensure 
that  the  system  would  meet  the  new,  tougher 
requirements.  After  further  testing  the  bridging  system 
finished  the  durability  testing  with  a  few  cracks  but 
none  that  would  impact  the  operational  mission. 

Summary 

The  DoD  is  working  closely  with  the  GEIA  on  the 
development  of  GEIA- STD-0009,  Reliability  Program 
Standard  for  Systems  Design,  Development,  and  Manu¬ 
facturing,  at  the  behest  of  the  Defense  Science  Board. 
It  is  hoped  that  GEIA- STD-0009  will  improve  the 
odds  that  military  systems  will  successfully  demonstrate 
reliability  requirements  in  both  DT  and  OT.  This 
article  provides  an  overview  of  GEIA-STD-0009, 
along  with  initial  guidance  on  its  application  with  an 
emphasis  on  the  assessment  and  verification  of  system 
reliability.  □ 

Dr.  Michael  J.  Cushing  is  director  (acting)  of  the  U.S. 
Army  Evaluation  Centers  Reliability  Cf  Maintainability 
Directorate.  He  earned  a  bachelor  of  science  degree  in 
electronics  engineering  and  computer  science  from  the  Johns 
Hopkins  University  and  both  master  of  science  and  Ph.D. 
degrees  in  reliability  engineering  from  the  University  of 
Maryland  at  College  Park.  During  25  years  in  U.S. 
military  reliability,  he  has  authored  numerous  publications, 
helped  formulate  and  implement  a  variety  of  reliability 
policies,  and  contributed  towards  several  reliability 
standards.  Current  activities  include  being  a  member  of 
both  the  U.S.  Department  of  Defense  Reliability  Im¬ 
provement  Working  Group  and  the  GEIA-Std-0009 
Working  Group.  E-mail:  michael.cushing@us.army.mil 

Margaret  Hockenberry  is  a  senior  operations  research 
analyst  in  charge  of  the  Reliability  Analysis  team  for  the 
U.S.  Army  Materiel  Systems  Analysis  Activity  (AMSAA). 


Her  team  at  AMSAA  is  executive  agent  for  Reliability  and 
Maintainability  (RAM)  Standardization  Reform  and 
Standards  Lead  for  the  U.S.  Army  Research  Development 
and  Engineering  Command  and  the  Army.  Her  focus  has 
been  on  updating  reliability  handbooks  in  order  to  provide 
the  latest  methodology.  She  is  a  member  of  the  GEIA- 
STD-0009  Working  Group  and  coauthor  of  section  three 
of  the  standard.  Hockenberry  holds  a  bachelor  of  arts  degree 
in  mathematics  and  a  master  in  business  administration 
degree  from  the  Florida  Institute  of  Technology.  E-mail: 
margaret.  hockenberry@us.  army,  mil 

E.  Andrew  Long  has  over  two  decades  of  experience  in 
a  broad  range  of  systems  analysis  problems  in  performance 
and  logistics  reliability.  He  has  participated  in  several 
efforts  to  help  programs  address  supportability  issues,  and 
has  performed  both  theoretical  and  applied  studies  of 
reliability  and  availability.  Recent  work  includes  logistics 
cost  realism  analysis  for  the  Coast  Guard’s  Integrated 
Deepwater  System.  He  performed  logistics  modeling  and 
footprint  analysis  for  the  Army’s  Future  Combat  Systems 
Program.  Currently,  he  supports  the  OSD  Director, 
Operational  Test  and  Evaluation  on  reliability  issues 
related  to  suitability.  E-mail:  andy.long.ctr@osd.mil 

References 

DoD  Guide  to  Achieving  Reliability,  Availability,  and 
Maintainability.  August  3,  2005. 

Ellner,  P.  M.  and  Hall,  J.  B.  2006.  Planning  model 
based  on  projection  methodology  (PM2).  Aberdeen 
Proving  Ground  (MD):  U.S.  Army  Materiel  Systems 
Analysis  Activity.  Technical  Report  No.  TR-2006-9. 

GEIA-STD-0009.  2008  (expected).  Reliability  Pro¬ 
gram  Standard  for  Systems  Design,  Development,  and 
Manufacturing. 

Military  Handbook  189.  1981.  Reliability  Growth 
Management. 

Military  Handbook  781.  1996.  Handbook  for  Reli¬ 
ability  Test  Methods,  Plans,  and  Environments  for 
Engineering,  Development,  Qualification,  and  Produc¬ 
tion. 

Pecht,  M.  G.  and  Nash,  F.  R.  1994.  “Predicting  the 
reliability  of  electronic  equipment.”  Proceedings  of  the 
IEEE.  pp.  992-1004. 


262  /lE/A  Journal 


The  ITEA  Journal  of  Test  and  Evaluation  jite-29-03-11.3d  18/8/08  15:56:38 


262 


